Challenges in Kurdish Text Processing

نویسنده

  • Kyumars Sheykh Esmaili
چکیده

Despite having a large number of speakers, the Kurdish language is among the less-resourced languages. In this work we highlight the challenges and problems in providing the required tools and techniques for processing texts written in Kurdish. From a high-level perspective, the main challenges are: the inherent diversity of the language, standardization and segmentation issues, and the lack of language resources.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison

Resource scarcity along with diversity– both in dialect and script–are the two primary challenges in Kurdish language processing. In this paper we aim at addressing these two problems by (i) building a text corpus for Sorani and Kurmanji, the two main dialects of Kurdish, and (ii) highlighting some of the orthographic, phonological, and morphological differences between these two dialects from ...

متن کامل

Stemming for Kurdish Information Retrieval

Resource scarcity along with diversity –in both dialect and script– are the two primary challenges in Kurdish language processing. In this paper we aim at addressing these two problems by building stemmers for the two main dialects of the Kurdish language (i.e. Sorani and Kurmanji) and investigate their effectiveness on Kurdish Information Retrieval. More specifically, we build Jedar, the first...

متن کامل

Towards Building KurdNet, the Kurdish WordNet

In this paper we highlight the main challenges in building a lexical database for Kurdish, a resource-scarce and diverse language. We also report on our effort in building the first prototype of KurdNet – the Kurdish WordNet– along with a preliminary evaluation of its impact on Kurdish information retrieval.

متن کامل

Fast Development of Basic NLP Tools: Towards a Lexicon and a POS Tagger for Kurmanji Kurdish

The development of basic NLP resources for minority languages is still a challenge to both formal and computational linguists. In this paper, we show how we were able to develop a medium-scale morphological lexicon for Kurmanji Kurdish in a few days time using only freely accessible resources. We also developed a preliminary POS tagger that shall be used as a pre-annotation tool for developing ...

متن کامل

Tokenizing an Arabic Script Language

In any natural language processing project, the input text needs to undergo tokenization before morphological analysis or parsing. For Arabic script languages the tokenization process faces more problems and it plays a more crucial role in natural language processing (NLP) systems for Arabic script languages. In this work we elaborate on some of these problems and present solutions for these. T...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1212.0074  شماره 

صفحات  -

تاریخ انتشار 2012